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Abstract 

ChlP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the 
genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of 
novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChlP-enriched 
sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of 
short (/, d) motifs (/<10, d<2) contained in ChlP-enriched regions. In this work we have developed a more powerful 
exhaustive method (FMotif) for finding long (/, d) motifs in DNA sequences. In conjunction with our method, we have 
adopted a simple ChlP-enriched sampling strategy for finding these motifs in large-scale ChlP-enriched regions. Empirical 
studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChlP-seq 
data sets and five TF ChlP-exo data sets have demonstrated that our proposed method is capable of finding these motifs 
with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/. 
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Introduction 

Protein-DNA interactions play key roles in several cellular 
processes and functions including DNA transcription, packaging, 
replication, and repair. Identification of regions such as transcrip- 
tion factor binding sites (TFBSs), which are targeted by proteins 
called transcription factors (TFs), is crucial for a better under- 
standing of transcriptional regulation. Although traditional 
footprinting assays can accurately identify the precise binding 
sites of any factor, this low-throughput method is highly technical 
and can only be used to analyze a single small region ( < 1 kilobase 
pairs (kb)) at a time. Chromatin immunoprecipitation followed by 
high-throughput deep sequencing (ChlP-seq) enables genome- 
wide detection of transcription factor binding sites as well as the 
localization of epigenetic regulatory markers on a genomic scale 
[1,2]. It typically returns millions of short (35-50 base pairs (bps)) 
sequence tags mapped onto a reference genome from a sample 
organism. Putative binding sites with high confidence can be 
extracted from peak-enriched regions in the genome by peak- 
calling programs [3] . However, the resolution of binding regions 
identified from ChlP-seq can be a few hundred base pairs and is 
one or two orders of magnitude larger than a typical TFBS. By 
using an exonuclease that trims DNA regions at a precise distance 
from binding sites, the novel ChlP-seq technique ChlP-exo is able 
to locate binding sites at high resolution [4]. However, according 
to the results in Rhee and Pugh [4], binding regions identified 



from ChlP-exo experiments may be tens of bps away from the 
exact binding locations, although some of them at the location 
indicated by the experiments. Computational methods are still 
needed to identify the exact binding locations of a TF in ChlP-seq 
or ChlP-exo data sets. 

Binding sites for a specific TF are often highly conserved and 
have strong evidence for sequence specificity [5] . An actual DNA 
region interacting with and bound by a single TF usually ranges in 
size from 8-10 to 16-20 bps. In the past two decades, numerous 
programs have been developed to identify over-represented DNA 
sequence motifs from the promoters of co-regulated or homolo- 
gous genes [6]. These programs can be divided into two groups. 
The first includes profile-based methods such as CONSENSUS 
[7], MEME [8], Gibsampler [9], AlignACE [10], PROJECTION 
[11], and CRMD [12], each of which attempts to maximize a 
statistic- or entropy-related score from a profile matrix (also called 
a position weight matrix (PWM)). The second group is comprised 
of consensus-based methods, which include SPELLER [13], 
WEEDER [14,15], MITRA-count [16], Voting [17], PMSprune 
[18], WINNOWER [19], iTriplet [20], VINE [21], Stemming 
[22], and RecMotif [23]. These progams are designed to find 
potential (l,d) motifs within DNA sequences [19], where / is the 
length of a motif and d is the maximum number of mutations 
between a predicted binding site and the motif consensus. In most 
cases, profile-based methods are faster but suffer from lower 
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accuracy due to their tendency to be trapped in a local optimum. 
Consensus-based methods are more accurate but slower due to the 
exponential growth of the search space with increasing values of / 
and d. 

Consensus-based methods can be further divided into two 
categories: pattern-driven and sample-driven approaches [16]. A 
pattern-driven approach attempts to enumerate all possible 4' /- 
mer motifs with lexical order, while a sample-driven approach tries 
to test all possible (/, d) motifs generated from real /-mers of input 
sequences. For the methods mentioned above, SPELLER, 
WEEDER, and MITRA-count are pattern-driven approaches 
and Voting, PMPprune, WINNOWER, iTriplet, VINE, Stem- 
ming, and RecMotif are sample-driven. By using pattern-driven 
approaches (with the exception of MITRA-count), one can 
automatically find planted (/, d) motifs without prior knowledge 
of the length /. On the contrary, sample-driven approaches require 
that / be specified for each run. In real applications, the exact 
length of motifs contained in a set of sequences is usually unknown. 
The pattern-driven algorithm WEEDER has been successful in 
real eukaryotic applications [24] but has not been improved upon 
to the best of our knowledge. In this study, we have developed a 
more powerful method to extract (l,d) motifs and their binding 
locations contained in DNA sequences without prior knowledge of 
motif length and have used this method to identify motifs and their 
binding locations in ChlP-enriched regions. 

The pattern-driven approach MITRA-count builds a mismatch 
tree for all /-mers first, then traverses search space recursively from 
the root down in depth-first order. Therefore, the length / of a 
predicted motif must be specified in advance. SPELLER 
enumerates all possible motifs in a depth-first manner throughout 
the search space, then scans and counts all possible instances of the 
current motif with length i (ie{\,2, ■ ■ ■ ,/}) from the suffix tree of 
input sequences. The algorithm can identify planted (/, d) motifs 
efficiently when /< 13 and d< 3 (see Table 1). In order to increase 
the speed of SPELLER, WEEDER includes an error ratio e 
(e~d/l) for the algorithm that narrows the search space such that 
for all !e{l,2,...,/} the number of mismatches between the first i 
nucleotides of a candidate /-mer motif and the first i nucleotides of 
a valid instance of the motif is at most si. The algorithm can 
accelerate SPELLER to some extent, especially when d // is small 
(e.g., fi?// < 0.25). Unfortunately, not all motif occurrences satisfy 
this restriction. WEEDER must lower the occurrence frequency 
q<N to make sure exact motifs will not be missed. However, 
WEEDER's run time increases dramatically with the decrease of 
q. For instance, for (15, 4), q should be lowered to half the number 
of sequences at the OOPS constraint (one occurrence(s) of the 
motif instance(s) p_er sequence) to make sure that the true motif will 
be discovered [14]. However, WEEDER's run time may be even 
longer than SPELLER under the condition that the two 
algorithms use the same programming techniques. Thus, a more 
efficient method is needed to improve the efficiency of pattern- 
driven algorithms without knowledge of the length of predicted 
motifs under the ZOMOPS constraint (zero, one or multiple 
occurrence(s) of the motif instance(s) p_er sequence). 

Additionally, the programs mentioned above are not compu- 
tationally efficient enough to process a large number of ChlP-seq 
peaks. In recent years, several programs have been developed to 
cope with large-scale ChlP-seq data. Some are ChlP-tailored 
versions of previously-developed software (e.g., ChlP-MEME [25], 
DREME [26], and GimmeMotifs [27]). These typically restrict 
motif discovery to a few hundred peaks and usually ignore the 
remaining unselected sequences. Other programs are faster 
versions of previous software (e.g., STEME [28], ChlPMunk 
[29], and HMS [30]). STEME is a faster version of MEME and 



Table 1. Comparisons between FMotif and other pattern- 
driven algorithms on (/, d) samples with A = 20, L = 600, and 
a = 0% noise sequences. 
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4.42m- 1 
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15.51m-/ 


(16, 5); t = 0.31 






41.19m-l 


15.50m-1 


(17, 6); T = 0.35 






6.58A-1 


3.1 lh-1 


(18, 6); t = 0.35 






6.84/?-1 


3.17A-1 



'WEEDER(f/)' indicates the execution time of WEEDER given the occurrence 
frequency threshold q. ' — ' indicates a run time of over 10 hours, s, m, and h are 
the units of a run time and denote seconds, minutes, and hours respectively. 
The number after each run time is the ranking number of a true planted motif 
among the top 25 predicted motifs. '/' after a run time indicates that the real 
motifs were not in the top 25. 
doi:1 0.1 371 /journal.pone.0086044.t001 

involves indexing sequences with a suffix tree, which accelerates 
the expectation-maximization (EM) steps. ChlPMunk combines 
EM with a greedy approach similar to CONSENSUS and 
decreases the run time of the optimization procedure. HMS is an 
improved version of Gibbs Sampler and combines stochastic 
sampling with deterministic, greedy search steps. Another group of 
programs integrate other information such as TFBS positional 
priors [31] or transcription start sites [32] in order to optimize a 
PWM of ChlP-enriched regions. As mentioned above, these 
programs still have a local optimum problem. Similar to 
SPELLER and WEEDER, some of these programs are consen- 
sus-based methods (sometimes called word enumeration methods). 
These include RSAT [33], Cisfinder [34] and POSMO [35]. 
RSAT is a word enumeration method and has been developed to 
process whole ChlP-seq peak data sets, but is limited to short (/, d) 
motifs (I <\0,d <2). Cisfinder is a word clustering method and 
combines short k-mer enumeration (k = 7, 8, or 9) with a 
clustering strategy. POSMO, also a word clustering method, uses 
TFBS positional bias information along with k-mev enumeration 
and clustering. However, both Cisfinder and POSMO use 
clustering methods to group short k-mers and therefore cannot 
find exact (/, d) motifs contained in sequences. Thus, finding exact 
(/, d) motifs with larger values of / and d in a large-scale sequence 
data set is still very difficult. 

According to a previous study by Keich and Pevzner [36], real 
signals may be mixed with spurious motifs contained in 
background sequences under the OOPS constraint when the 
degenerative ratio T = d/l>0.25. Alarger x makes it more difficult 
to discriminate between a real motif and spurious motifs. 
However, some sequences may not contain any occurrence of a 
motif. As previously mentioned, we have concentrated on a more 
generalized model (the ZOMOPS constraint). Under this 
constraint, we have found that, except for the degenerative ratio 
t, the ratio of noise sequences a. = (N — Q)/N, where N is the 
number of sequences and Q is the number of sequences containing 
at least one variant of a motif, negatively affects (/, d) motif 
searches. A larger a leads to more spurious motifs in background 
sequences. It is suspected that 30% of factor-bound locations in 
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ChlP-seq data may be false positives [4]. Plus, there may be 
different versions of DNA-binding motifs for any given TF. A 
specified motif may only occur in 30% of binding regions. 
Although false positive rates in ChlP-seq data sets are low enough 
that statistical conclusions can be drawn in most cases, the noise 
(plus the diversity of DNA-binding modes) still interrupts the 
motif-finding process and alters motif-finding results. Thus, this 
may not be the best way to identify motifs in full-size ChlP-seq 
data sets. After running a peak-calling program on a raw ChlP-seq 
data set, peaks along with their ChIP enrichment values, ^-values, 
or false discovery rates (FDRs) can be obtained. False positive 
peaks are those with low peak enrichment values, /"-values, or 
FDRs. A better method may be to find motifs with a high 
confidence value (i.e., those that are plentiful enough to draw 
statistical conclusions) in peak-enriched regions and subsequendy 
scan their binding locations with the degenerative value d in the 
remaining peak regions that have low peak enrichment values, 
p-values, or FDRs. This would not only exclude more noise and 
spurious motifs [37], but it would also take advantage of well- 
developed motif-finding tools with an acceptable level of 
scalability. A similar idea was used in MICSA and achieved good 
performance [38]. However, MICSA used the optimal method 
MEME (the accuracy of which is limited [12,19]) to get the PWM 
of a motif for only the first three hundred peak-enriched regions. 

In this study we have found that for motifs with length /, both 
SPELLER and WEEDER have been designed to check each 
/-mer (!</) in the pattern space with depth-first order and count 
the variants of the /-mer in the suffix tree of sequences from the 
root to layer i. The suffix tree is scanned one time for each i-mer 
pattern. Thus, as i increases, the algorithms scan the suffix tree an 
increasing number of times. In fact, the mismatch information in 
layer i of a suffix tree can be used to search for (i+ l)-mers in the 
pattern space. For this reason, we constructed a new suffix tree 
structure with mismatch information (called a mismatched suffix 
tree) and developed a fast motif enumerative method (FMotif) 
under the ZOMOPS constraint. Using the newly constructed 
suffix trees, we incorporated the mismatch information in layer i of 
the mismatched suffix trees to verify (i+l)-mers in the pattern 
space. We then updated mismatch information in layer i+ 1 of the 
mismatched suffix trees. In this way we were able to implement a 
depth-first search within the pattern space and the mismatched 
suffix trees simultaneously. To process large-scale ChlP-seq data 
sets, we integrated the peak detection method MACS [39] with 
our motif-finding method and ChlP-enriched sampling strategy, 
which allowed us to locate the exact binding locations in ChlP-seq 
and ChlP-exo data sets. We chose MACS because it has been 
shown to perform well when compared to several other peak- 
calling programs [3]. 

Results 

Experimental Results on Artificial Data Sets 

We compared FMotif with the existing pattern-driving methods 
including SPELLER, WEEDER, and MITRA-count (MITRA for 
short) on synthetic samples to show the efficiency of our proposed 
method. All synthetic samples were generated following the 
method of Pevsner and Sze [19], where Q (Q<N) variants of 
an /-length motif were randomly planted into Q sequences selected 
randomly from a set of N sequences with length L. In this (/, d) 
model, each planted variant of the motif with length / had exactiy 
d mismatches with the motif itself. 

In the first group of experiments, we tested the performance of 
these algorithms on (/, d) sample sets without noise sequences (i.e., 
Q = N) at standard settings, where the number N and the length L 



of sequences are set to 20 and 600, respectively [14,16,19]. These 
test results are shown in Table 1. 'WEEDER^)' indicates the 
execution time of WEEDER given the occurrence frequency 
threshold q.' — ' indicates a run time of over 10 hours, s, m, and h 
denote seconds, minutes, and hours respectively. The number 
after each run time is the ranking number of a true planted motif 
among the top 25 predicted motifs. '/' after a run time indicates 
that the real motifs were not in the top 25. In the second group of 
experiments, we first tested the influence of the ratio of noise 
sequences a (a = (N — Q)/N) on (/, d) samples using FMotif with 
typical settings (i.e., A = 20 and L = 600). In order to provide a 
more comprehensive comparison of the calculation speed when 
noise was added, we compared FMotif to SPELLER and MITRA 
on (10, 2), (11, 2), (12, 3), and (13, 3) samples. We avoided 
comparisons over motifs more complicated than (13, 3) because 
SPELLER lacked computational efficiency on these problems and 
WEEDER required tuning of parameter q. These test results are 
shown in Table 2, where a is set at 5%, 10%, 15%, • • •, and 40%, '/' 
indicates that the real motifs were not in the top 25, and the first 
line for (10, 2), (11, 2), (12, 3) and (13, 3) is the FMotif result, the 
second line (denoted by 'M..') is the MITRA result, and the third 
line (denoted 'S..') is the SPELLER result. We then tested the 
influence of the noise ratio a on samples with A =1000 and 
L= 100 to simulate ChlP-enriched regions because those regions 
are usually relatively short and the number of regions is usually 
large. We subsequently compared FMotif to SPELLER and 
MITRA on (10, 2), (11, 2), (12, 3), and (13, 3) samples as before. 
These test results are shown in Table 3, where a is set at 
10%, 20%, • • •, and 80%. In the third group of experiments, we 
tested FMotif scalability using two groups of samples to see 
whether it was suitable for recognizing motifs in large-scale ChlP- 
enriched regions. The settings of the first group were L=100, 
N= 1000,2000, • • • ,8000 and no noise sequences (oc = 0%). These 
test results are shown in Table 4. The settings of the second group 
were L=100, N= 1000,2000, • • • ,8000 and a = 30% noise 
sequences in order to mimick ChlP-seq data. These test results 
are shown in Table 5. All experiments were performed on a 
computer with an Intel 2.99 GHz processor, 2.00GB of main 
memory, and the Windows XP operating system. 

The results in Tables 1-5 lead to three observations. First, 
FMotif is a fast and exact algorithm and capable of finding (/, d) 
motifs in synthetic samples without being given the length / of a 
predicted motif. It performs faster than SPELLER, MITRA, and 
WEEDER without sacrificing accuracy. As mentioned above, 
WEEDER's efficiency suffers significandy (see (14, 4) in Table 1 
for an example) when the occurrence frequency threshold q is too 
low, and MITRA requires that the length / be specified a priori. It 
should be noted that FMotif ranked all motifs with different 
lengths together by significance score. For the samples whose true 
motifs were not ranked in the top 25, the top motifs were usually 
(/— 1)- or (1 — 2)- substrings of the true motifs with length /. In 
these cases the true motifs were still in the output list but were 
ranked below the top 25. Second, noise sequences have a strong 
effect on the results and the speed of the method. With an increase 
in a, the run time increases as well. Like the degenerative ratio 
T = d/l, the ratio of noise sequences a also weakens motifs, 
especially when background sequences are long (see Table 2). 
Spurious motifs in background sequences bury the authentic 
signals when either T or oc is large. For example, real motifs were 
difficult to filter by their significance score for the (15, 5) motif in 
Table 2, even when the noise sequence ratio was set to 5%. In this 
case, many spurious motifs of length 14-15 with a large 
significance score were ranked among the top 25. When the 
length of background sequences was shorter and the number of 
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Table 2. Results for noise-influenced models on (/, d) samples with N = 20, L = 600, and a = 5%, 10%, ■ ■ ■, 40% noise sequences. 
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1 .63.S-5 
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5.23.S-15 
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38.99.s-1 
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4.69)))-21 


/ 
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50.58.s-1 


57.14.s-1 


1.12m-1 
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9.80m- 1 


17.71m-l 


27.07m-1 


36.1 2m- 1 


45.80m-1 


58.33m-1 


1 .30A-1 


1.79/1-1 


(14, 4) 
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2.33m-1 
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(15,4) 
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(15, 5) 
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29.52m-1 
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/ 
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/ 
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The ratio of noise sequences a is set at 5%, 10%, 15%, • • ■, and 40%, '/' indicates that the real motifs were not in the top 25. The number after each run time is the ranking 
number of a true planted motif among the top 25 predicted motifs. The first line for (1 0, 2), (1 1 , 2), (12, 3) and (13, 3) is the FMotif result, the second line (denoted by 
'M..') is the MITRA result, and the third line (denoted 'S..') is the SPELLER result, .s, m, and /) denote seconds, minutes, and hours respectively. 
doi:1 0.1 371 /journal.pone.0086044.t002 



sequences was larger, the signals were stronger and could be easily 
identified even if a large portion of noise sequences was added (see 
Table 3). This is consistent with the previous result that false- 
positives could be reduced by decreasing the sequence length or by 
adding more sequences to the data set [37]. Third, as shown in 
Tables 4 and 5, FMotif is capable of operating on a large scale 
even when there are 30% noise sequences in samples. This allowed 
us to use FMotif to process peak regions within ChlP-seq and 
ChlP-exo data sets. 

Additionally, we compared FMotif with CisFinder, which uses 
k-mer enumeration with /c-mer clustering to find motifs in large- 
scale ChlP-seq peak regions. Using both algorithms, we verified 
the accuracy of FMotif and CisFinder by searching for long (/, d) 
motifs in synthetic sample sets with 3000 sequences, each of which 
contained a planted variant of a parent motif. The experimental 
results are shown in Table 6, where 'Planted Motif indicates a 
planted motif consensus in a set of sequences, 'FMotif (Top-1)' 
indicates the top ranked motif consensus found by FMotif in a 
sample set, 'CisFinder' indicates the closest matching motif 
consensus (described by IUPAC nucleotide codes) found by 
CisFinder in a sample set, indicates the number of variants 
of a reported motif found by FMotif or CisFinder in a sample set, 
and 'Rank' after '# — ' is the ranking number of the reported motif 
found by Cisfinder in Table 6. 

We used the site-level sensitivity (sSn) and positive predictive value 
(sPPV) metrics described by Tompa [24] to statistically quantify 
the accuracy of the two methods, where sSn = sTP / (sTP -\- sFN) 
and sPPV = sTP /(sTP + sFP), sTP is the number of known sites 
overlapping predicted sites, sFN is the number of known sites not 
overlapping predicted sites, and sFP is the number of predicted 
sites not overlapping known sites. A predicted site overlaps a 
known site if they share at least a half of the length of known sites. 
In order to give a more comprehensive comparison of the 



accuracy of the two methods on simulated ChlP-seq data sets, we 
added 30% noise sequences to samples with N = 3000 and L = 100 
and performed the experiments again. These test results are shown 
in Table 7. 

As evident from Tables 6 and 7, FMotif is an exact algorithm. It 
reported all true motif consensuses and their planted variants plus 
false positive variants in background sequences. CisFinder 
performed quickly but suffered from low accuracy (due to low 
sensitivity), especially when i = d/l was large. FMotif and CisFinder 
both were robust after 30 noise sequences were added to the 
samples. It should be pointed out that, although there are various 
resources on CisFinder's website (http:/ /lgsun.grc. nia.nih.gov/cis- 
fmder/download.html), we used only the motif-finding program. 
There are other programs focused on motif clustering, motif 
improvement, motif comparison, and other tasks. If all programs 
were used together, a better motif and more of its binding sites 
may be identified. However, the CisFinder algorithm [34] was 
implemented in that motif-finding program and there was no 
direct way to use all these programs together based on our 
knowledge. 

Experimental Results Using ChlP-seq Data Sets 

We tested FMotif using 12 mouse ChlP-seq data sets for 12 
DNA-binding TFs (CTCF, cMyc, Esrrb, Klf4, Nanog, nMyc, 
Oct4, Smadl, Sox2, STAT3, Tcfcp2Il, and Zfx) involved in 
mouse embryonic stem cell pluripotency and self-renewal [40]. 
These ChlP-seq data sets have been deposited in the GEO 
database with ID number GSE1 1431. We also tested FMotif using 
four widely used human ChlP-seq data sets for four DNA-binding 
TFs including CTCF (CCCTC-binding factor [41], named 
CTCF(/j) in the paper), FoxAl (hepatocyte nuclear factor 3oe 
[42]), NRSF (neuron-restrictive silencer factor [2]), and STAT1 
(signal transducer and activator of transcription protein [1]). The 
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Table 3. Results for noise-influenced models on (/, d) samples with N= 1000, L= 100, and a = 10%, 20%, • • •, 80% noise 
sequences. 



(14) 


a = 10% 


a = 20% 


a = 30% 


a = 40% 


a = 50% 


a = 60% 


« = 70% 


a = 80% 


(10,2) 


6.39.S-1 


7.635-1 


1 0.36.S-1 


12.95.s-1 


13.63.s-1 


14.64.s-1 


23.13.s-1 


24.84.s-6 


M.. 


12.44.s-1 


12.49.s-1 


1 9.38.S-1 


31.61.s-1 


31.97.s-1 


32.29.v-1 


1.47/77-1 


1 .64///-1 


S.. 


1 .23/77-1 


2.29///-1 


6.84//1-1 


10.68///-1 


12.72/77-1 


1 5.03177-1 


1 .05/7-1 


1.23/1-1 


(11,2) 


6.41.S-1 


7.70J-1 


10.28.s-1 


12.59.s-1 


13.60.s-1 


14.66.s-1 


23.01 J- 1 


24.86.s-2 


M.. 


12.41.s-1 


12.66.s-1 


1 8.58.S-1 


31.44.s-1 


31.86.s-1 


32.16.s-1 


1.41///-1 


1 .64177-1 


S.. 


1 .03///-1 


2.09/77-1 


5.86iii-1 


10.64/77-1 


12.74/77-1 


1 5.08/n-l 


1 .05//-1 


1.23/1-1 


(12,3) 


1.19/W-1 


1.44/77-1 


I.661/1-I 


2.40///-1 


2.94/77-1 


3.25//1-1 


3.56///-1 


/ 


M.. 


2.44///-1 


2.84/77-1 


2.87iii-1 


4.55///-1 


7.92/77-1 


7.98171-1 


8.63///-1 


25.91 7//-1 


S.. 


15.45/77-1 


32.14/77-1 


45.25i?7-1 


2.22/7-1 


3.80/7-1 


4.42A-1 


6.00//-1 


22.34//-1 


(13,3) 


1.18/B-1 


1 .43/71-1 


I.661/1-I 


2.33//1-1 


2.91/77-1 


3.50iii-1 


3.56///-1 


5.72/77-10 


M.. 


2.33///-1 


2.84m-1 


2.87ni-1 


4.31//1-1 


7.90/77-1 


7.96171-1 


8.41 7Z7-1 


25.90///-1 


S.. 


15.39m- 1 


32.30/77-1 


46.201/1-1 


2.20//-1 


3.82//-1 


4.40/1-1 


5.92/7-1 


22.25//-1 


(14,4) 


10.03//1-1 


16.35/77-1 


19.171/1-1 


21.63///-1 


30.65/77-1 


42.051/7-1 


45.05///-1 


/ 


(15,4) 


10.07m-1 


16.36/17-1 


19.19iii-1 


21.78/7/-1 


30.67/11-1 


42.02/77-1 


45.04/7/-1 


/ 


(15,5) 


1.69/7-1 


2.23//-1 


3.84/1-1 


4.66//-1 


5.18/1-1 


7.38A-1 


10.60//-13 


/ 


(16,5) 


1. 70A-1 


2.24A-1 


3.89/1-1 


4.67/7-1 


5.20/7-1 


7.38/1-1 


10.61//-3 


/ 





The ratio of noise sequences a is set at 10%,2Q%, 
doi:1 0.1 371 /journal.pone.0086044.t003 



, and 80%. Row definitions are the same as those in Table 2. 



raw sequence of the FoxAl ChlP-seq data set was downloaded 
from http://liulab.dfci.harvard.edu/MACS/Sample.html. The 
bed format of the CTCF, NRSF and STAT1 ChlP-seq data sets 
was downloaded from http://dir.nhlbi.nih.gov/papers/lmi/ 
epigenomes/sissrs/. These downloaded short reads were mapped 
onto the newest version of mouse genome assembly mm 10 and 
human genome assembly hgl9, respectively. The peak regions 
were extracted from these reads using the peak finding program 
MACS [39] with a false discovery rate (FDR) threshold of 0.2. The 
reads were ranked by their FDR if a negative control was 
available, or by //-value otherwise. To prepare the data sets for use 
with motif discovery algorithms, we mapped the summits of the 
ChlP-seq peaks and extracted the 100 bps of genomic sequence 
centered around each peak. 

In order to facilitate a fast motif search, avoid the potential 
influence of false positive peaks, and reduce false positive motifs in 
background sequences [37], we ran FMotif on the first 3000 ChlP- 
enriched genomic sequences and then scanned for potential 
binding locations in the remaining genomic sequences with the 
degenerative value d. Since binding sites could exist on either 
DNA strand and CisFinder searched both, we counted the 
instances of a predicted motif and those of its reverse complement 
motif. We then compared FMotif with CisFinder and published 
motifs [2,40,42-44] in literature. The experimental results are 
shown in Figures 1 and 2, where 'Nb' indicates the number of 
peak-enriched regions predicted by the peak-calling program 
MACS with an FDR threshold of 0.2 or a //-value threshold of 
10~ 5 , 'FMotif and 'CisFinder' indicate the closest matching motif 
logos found by these programs (all motif logos were generated 
using the web-based tool Weblogo [45]), 'Literature' indicates the 
corresponding motif logos published in literature, '#' indicates the 
number of binding sites found by either FMotif or CisFinder, and 
'Rank' after — ' is the ranking number of a reported motif found 
by either FMotif or CisFinder. 

We compared predicted and published motifs using a motif- 
level accuracy measure called the performance coefficient 



I U Pi f ll/ll U(J V\\, where U is a predicted motif consensus and 
V is the motif consensus published in literature [19]. As shown in 
Figures 1 and 2, the motif logos found by FMotif were more 
accurate compared with published logos from literature than those 
found by CisFinder. Furthermore, FMotif identified more TFBS 
locations than CisFinder. As for the 12 mouse TFs DNA-binding 
logos in [40], Chen et al. used the motif discovery algorithm 
WEEDER and subsequently extended the motifs using an 
expectation-maximization method. This second step was necessary 
because the supplied version of the WEEDER algorithm limited 
the motif search to a maximum of 12 bps. As discussed in the 
previous sections, WEEDER operated with low efficiency for long 
motifs and was difficult to tune for the parameter q. 

To estimate the robustness of our sampling strategy, we ran 
FMotif on the first 500, 1000, 1500, and 5000 sequences and 
the full-size ChlP-enriched genomic sequences of TFs n-Myc, 
Oct4, and NRSF. For all subsets and the full-size data sets, each of 
the corresponding motifs in Figures 1 and 2 was ranked within the 
top 25 motifs predicted by FMotif. The ranking number of 
reported motifs increased with subset size and tended to be stable 
when the size was greater than 1000. All potential binding sites of 
reported motifs were obtained from subsets and discovered during 
the scanning step. Thus, it was not necessary to run a motif-finding 
algorithm on the whole ChlP-seq data sets, especially when data 
sets were very large. Additionally, we tested FMotif on N 
randomly selected sequences (A = 500, 1000, 1500, and 
3000). These experiments were repeated 10 times. We then 
compared these results to those of the first N sequences and those 
of the last N sequences for TFs n-Myc, Oct4, and NRSF. In 
general, motif consensuses predicted from the first N sequences 
were the most similar to published motifs and ranked highest in 
the final output. Those predicted from randomly selected N 
sequences tended to be ranked second, while those predicted from 
the last N sequences were usually ranked the lowest. Furthermore, 
for the same reported motif of a TF, the number of binding sites 
found in the first N sequences was significantly greater than that 
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Table 4. A demonstration of FMotif scalability on (/, d) samples for N= 1000,2000, ■ ■ ■ ,8000 sequences, L= 100, and no (ot=0%) 
noise sequences. 





1000 


2000 


3000 


4000 


5000 


6000 


7000 


8000 


(10, 2) 


3.345-1 


8.055-1 


11.90.s-1 


15.805-1 


19.58.s-1 


23.34.s-1 


26.835-1 


42.81.s-1 


(11, 2) 


3.365-1 


8.085-1 


11.86.s-1 


15.61.s-1 


19.45.v-1 


23.16.s-1 


26.845-1 


39.25.v-1 


(12, 3) 


33.41.s-1 


1.31/n-l 


1 .84m- 1 


2.38///-1 


2.78/H-1 


3.32m-1 


4.03///-1 


4.46in-1 


(13, 3) 


34.625-1 


1.30/n-l 


1 .85m- 1 


2.40m-1 


2.84/n-l 


3.32m-1 


3.82///-1 


4.33m-1 


(14, 4) 


4.51m-1 


10.22m-1 


15.14m- 1 


19.93m- 1 


24.85m-1 


29.54m-1 


34.25//1-1 


39.03m-1 


(15, 4) 


4.52/n-l 


10.301H-1 


15.23m- 1 


20.05m-l 


24.87iii-1 


29.16m- 1 


34.38//1-1 


39.26iii-1 


(15, 5) 


35.06/n-l 


1.46/1-1 


2.15//-1 


2.68/1-1 


3.21/1-1 


3.75//-1 


4.27/t-l 


4.77/1-1 


(16, 5) 


35.02/11-1 


1.47A-1 


2.13//-1 


2.70/1-1 


3.24/1-1 


3.84//-1 


4.27/1-1 


4.75/1-1 





The number after each run time is the ranking number of a true planted motif among the top 25 predicted motifs. 5, m, and // denote seconds, minutes, and hours 
respectively. 

doi:1 0.1 371 /journal.pone.0086044.t004 



found in randomly selected N sequences. The number of 
predicted binding sites found in the last N sequences was the 
lowest. In some cases there was no corresponding motif in 
randomly selected N sequences or in the last N sequences when 
employing the same parameter settings. This situation occurred 
more often when using the last N sequences. Therefore, we 
decided that the first N sequences with the lowest p-value or FDR 
(i.e., the most ChlP-enriched sequences) were the best choice for 
drawing statistical conclusions about a corresponding motif. This is 
because, as discussed in the Introduction section, the first N 
sequences were the least affected by noise. We selected the first 
3000 peak regions to be sure that the selected subsets were large 
enough to account for the specificity of TF-DNA binding and to 
exclude false positive motifs. The same results may be obtained by 
running the algorithm on the first 1000-2000 sequences and then 
scanning potential locations in the remaining sequences. 

FMotif Sensitivity 

To test the sensitivity of FMotif, we ran it on an NRSF-positive 
TFBS set (NRSF/qPCR), which was composed of 83 binding sites 
verified by qPCR [2] . We then ran FMotif on four yeast DNA- 
binding TFs (Rebl, Gal4, Phdl, and Rapl) and one human TF 
(CTCF) ChlP-exo data sets. Raw sequence of the five ChlP-exo 
data sets are available from the NCBI Sequence Read Archive 
with accession number SRA0044886 [4]. Since it is thought 
that > 98% of ChlP-exo peak regions contain one recognizable 



DNA-binding motif within tens of bps away from peak summits, 
these can be viewed as positive TFBS sets. We used the five ChlP- 
exo peaks reported in Data S 1 from Rhee and Pugh [4] . Similarly, 
we mapped the summits of ChlP-exo peaks and extracted 50 bps 
of genomic sequence centered around each peak in yeast genome 
sacCer3 and human genome hgl9, respectively. This allowed us to 
avoid peak regions overlapping with each other due to some of the 
summits of ChlP-exo peaks being very close together. Results from 
CisFinder and published motifs [2,4,43,46-48] in literature are 
shown for comparison (see Figure 3). 

As shown in Figure 3, FMotif was capable of finding more 
matching motifs and true TF-binding locations when compared to 
CisFinder. For example, 76 true binding sites of NRSF/ qPCR 
were predicted exactiy by FMotif. On the same data set (NRSF/ 
qPCR), MICSA [38] using MEME reported only 55 sites. This 
highlights the fact that FMotif is capable of identifying TF-binding 
locations with high sensitivity. It is well-established that specificity 
is an important consideration for this type of method. However, 
the ChlP-exo technique is a high-throughput approach, and the 
resolution of binding regions identified by ChlP-exo may still be 
tens of base pairs from where the true binding sites of between 8 
and 25 base pairs are located. In addition, some of those binding 
regions are false positives, and it is difficult to say which ones are 
truly false positives without carefully designed wet-lab experi- 
ments. However, we show the specificity (i.e., sPPV) of FMotif for 
artificial samples in Tables 6 and 7. From this information we 



Table 5. A demonstration of FMotif scalability on (/, d) samples for N- 
sequences. 



: 1000,2000, ■ ■ ■ ,8000 sequences, L= 100, and a = 30% noise 



(14) 1000 2000 3000 4000 5000 6000 7000 8000 



(10, 2) 


10.33.v-l 


26.19.v-1 


45.16.v-1 


I.IOm-1 


l.63m-1 


2.09m-1 


2.70///-1 


2.97/H-1 


(11, 2) 


10.33.v-1 


25.89.v-1 


45.30.v-1 


1.09m-1 


1.49m-1 


1.90m-1 


2.43/1/-1 


2.88m-1 


(12, 3) 


1.66m-1 


4.13m-1 


7.26m-1 


10.70m-1 


1 4.07m- 1 


17.52m-1 


21.40m- 1 


25.76/H-1 


(13, 3) 


1.66m-1 


4.1 5m- 1 


7.27m-1 


10.72/I/-1 


1 4.06m- 1 


17.56m-1 


21.51iii-1 


25.94(11-1 


(14, 4) 


19.1 4m- 1 


45.85m-1 


1.28//-1 


1.90//-1 


2.59/1-1 


3.24//-1 


3.88/1-1 


4.52//-1 


(15,4) 


19.1 6m- 1 


45.81 m-1 


1 .29/1-1 


1.91A-1 


2.59//-1 


3.25/1-1 


3.89/1-1 


4.54/1-1 


(15, 5) 


3.86//-1 


8.83//-1 


14.70/1-1 


21.28//-1 


28.40//-1 


35.48/1-1 


43.08/1-1 


50.36A-1 


(16, 5) 


3.87A-1 


8.61 //-1 


14.64//-1 


20.81//-1 


28.52A-1 


35.15//-1 


43.35/1-1 


51.22/1-1 



Row and column definitions are the same as those in Table 4. 
doi:1 0.1 371 /journal.pone.0086044.t005 
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Table 6. Comparisons between FMotif and CisFinder on large (/, d) samples with L = 100, N = 
sequences. 



A Fast (/, d) Motif Finding Algorithm 
3000, and no (a=0%) noise 



(14) 


Planted Motif 


FMotif (Top-1) 


CisFinder i -Rank) 






(#)-(sSn)-(sPPV) 


(#-Rank)-(sSn)-(sPPV) 


(10, 2) 


CACGAGAACC 


CACGAGAACC 


CACGANAACC 






(3108)-(1.0)-(0.97) 


(66-1)-(0.02)-(0.98) 


(11, 2) 


TTGACAAGGAT 


TTGACAAGGAT 


TTVACAASGA 






(3026)-(1.0)-(0.99) 


(186-1)-(0.06)-(0.96) 


(12, 3) 


TCCATTAGGTGG 


TCCATTAGGTGG 


CCWMCTAABKGAMC 






(3089)-(1.0)-(0.97) 


(80-1)-(0.02)-(0.93) 


(13, 3) 


CGATAGGTCTATG 


CGATAGGTCTATG 


ATAGKYCTA 






(3026)-(1.0)-(0.99) 


(148-1)-(0.05)-(0.96) 


(14, 4) 


AGCTATCTATTTAA 


AGCTATCTATTTAA 


TAAANWGATA 






(3161)-(1.0)-(0.95) 


(75-1)-(0.02)-(0.85) 


(15, 4) 


GATCACACGGAAACC 


GATCACACGGAAACC 


CACACGGAAAC 






(3022)-(1.0)-(0.99) 


(109-3)-(0.04)-(0.98) 


(15, 5) 


GGTGGGGCGGGCGAT 


GGTGGGGCGGGCGAT 


CMGGYYGGGKCG 






(3371)-(1.0)-(0.89) 


(40-1)-(0.01)-(0.77) 


(16, 5) 


GAGGCTTGTAAACGTT 


GAGGCTTGTAAACGTT 


GGMGKGTAAAMGTTKC 






(3062)-(1.0)-(0.98) 


(59-1)-(0.02)-(0.85) 



'Planted Motif indicates a planted motif consensus in a set of sequences, 'FMotif {Top-1)' indicates the top ranked motif consensus found by FMotif in a sample set, 
'CisFinder' indicates the closest matching motif consensus (described by IUPAC nucleotide codes} found by CisFinder in a sample set, '#' indicates the number of 
variants of a reported motif found by FMotif or CisFinder in a sample set, and 'Rank' after '# — ' is the ranking number of the reported motif found by Cisfinder. The site- 
level sensitivity [sSn) and positive predictive value {sPPV) metrics described by Tompa [24] were used to statistically quantify the accuracy of the two methods, where 
sSn = sTP /(sTP+sFN) and sPPV = sTPj{sTP+sFP), sTP is the number of known sites overlapping predicted sites, sFN is the number of known sites not 
overlapping predicted sites, and sFP is the number of predicted sites not overlapping known sites. A predicted site overlaps a known site if they share at least a half of 
the length of known sites. 
doi:1 0.1 371 /joumal.pone.0086044.t006 



Table 7. Comparisons between FMotif and CisFinder on large (/, d) samples with L= 100, TV = 3000, and a = 30% noise sequences. 



(14) 


Planted Motif 


FMotif (Top-1) 


CisFinder 






(#)-(sSn)-(sPPV) 


(#-Rank)-(sSn)-(jPP V) 


(10, 2) 


TGACCCCACG 


TGACCCCACG 


YHGAYCHMACGSM 






(2192)-(1.0)-(0.96) 


(65-2)-(0.03)-(0.89) 


(11, 2) 


GCGGTGTACCA 


GCGGTGTACCA 


GCGGTNTACC 






(2130)-(1.0)-(0.99) 


(120-2)-(0.06)-(0.99) 


(12, 3) 


CACGGGCCTTAG 


CACGGGCCTTAG 


CAKSGGCCBBAG 






(2182)-(1.0)-(0.96) 


(61-2)-(0.03)-(0.85) 


(13, 3) 


TTCAGTAAGCACG 


TTCAGTAAGCACG 


TTCRGTAARCAYG 






(2124)-(1.0)-(0.99) 


(99-1)-(0.05)-(0.96) 


(14, 4) 


GCAAGTCACCGTGT 


GCAAGTCACCGTGT 


RVAAGTWBNGTGT 






(2167)-(1.0)-(0.97) 


(42-2)-(0.02)-(0.90) 


(15, 4) 


AAGGTGTTGGTATGG 


AAGGTGTTGGTATGG 


AARGTGTTGGTATGGG 






(2137)-(1.0)-(0.98) 


(70-2)-(0.03)-(0.90) 


(15, 5) 


AATACTGTGCATGGA 


AATACTGTGCATGGA 


AATWCTGTSCA 






(2272)-(1.0)-(0.92) 


(27-1H0.01M0.70) 


(16, 5) 


AGCTTGCCAGCGACGT 


AGCTTGCCAGCGACGT 


VGCTSKCCAGCWACGT 






(2145)-(1.0)-(0.98) 


(51-1)-(0.02)-(0.90) 



Column definitions are the same as those in Table 6. 
doi:1 0.1 371 /journal.pone.0086044.t007 
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Data Set (Nb) 


FMotif' (#-Rank) 


CisFinder (#-Rank) 


Literature [40] 


c-Myc (2840) 


S L — ( m yv ■ I T*. — 

' • * • - • 1 ~ (3211-1) 


'IcuAC 


cCCi.c 


CTCF (46872) 


' (37308-1) 


'l CCA tA A c 

J-WtoA^s- (10011.2) 




u 


fc... 


Esrrb (51784) 


"L<- tCAaGGtCA 

• (34789-1) 


■UAA tCA 

c^MVJvriVH ;;( . 185 _ n 




1 


Klf4 (17409) 


J---?— =■ <?? (14800-1) 


J gGGLGtGg^ 

J , .... . 4» . . . ( 277 . i) 


GGGeGiGC 


Qq 


Nanog (8367) 


2 

A 1 1 — w~ _ 1 , . A A A 

All TtA| t CAAH 

' "~ *-=-*^V¥ rl . (48i8_i) 


! LTT aT aL 

J* A I ! ,WMf?vvW, (291-9) 








n-Myc (8032) 


... -— .---^ (8113-1) 


! iCaC VtG 

• - ' T ^ ■ ■_, (577-15) 






Oct4 (3775) 


'TIT -i- aT i~a»A 


•1 T TT T AT cA 

1 .11! m 3i (4 io_i) 




l.c, T 


Smadl (1188) 


J 

° L-tT a A A a A a a 

J^TI^^M^ (997 .2) 


! ] T aCCTTg 

A — ! ! (72-1) 


silk 






Sox2 (4593) 


" L~A 1 1 T fll PAaA 

^i!4l^i*?#t A , (2133.1) 


-.?TT, (4 i9_i) 


?J1.I 1.c.;..t 


STAT3 (2272) 


iTtcC GgaA 


'TtCC GQaA 

,J.T?^a T $M (273 _ i) 


-TTOCLgGU 


Tcfcp2Il (24635) 


C<F. tTcaAaCc 

.U .T TT »nv, (8165-1) 


iA.W5.ttfe._i (1724.1) 




Zfx (19580) 


'1 CC. CC 

^ = . = - - . . i , (34059-1) 


i™-"- (828-1) 





Figure 1. Motifs in 12 mouse ES Cell ChlP-seq data sets. FMotif was tested using mouse ChlP-seq data sets for 12 DNA-binding TFs (CTCF, 
cMyc, Esrrb, Klf4, Nanog, nMyc, Oct4, Smadl, Sox2, STAT3, Tcfcp2l1, and Zfx) involved in mouse embryonic stem cell pluripotency and self-renewal 
[40]. Results from CisFinder and published motifs in literature are shown for comparison. 'Nb' indicates the number of peak-enriched regions 
predicted by the peak-calling program MACS with an FDR threshold of 0.2 or a ;;-value threshold of 10~ 5 , 'FMotif and 'CisFinder' indicate the closest 
matching motif logos found by these programs (all motif logos were generated using the web-based tool Weblogo [45]), 'Literature' indicates the 
corresponding motif logos published in literature, '#' indicates the number of binding sites found by either FMotif or CisFinder, and 'Rank' after '# — ' 
is the ranking number of a reported motif found by either FMotif or CisFinder. 
doi:1 0.1 371 /journal.pone.0086044.g001 
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Figure 2. Motifs in 4 human TF ChlP-seq data sets. FMotif was tested with four widely used human ChlP-seq data sets for four DNA-binding TFs 
including CTCF (CCCTC-binding factor [41], named CTCF(/i)), FoxAl (hepatocyte nuclear factor 3a [42]), NRSF (neuron-restrictive silencer factor [2]), 
and STAT1 (signal transducer and activator of transcription protein [1]). Results from CisFinder and published motifs in literature are shown for 
comparison. Column definitions are the same as those in Figure 1. 
doi:1 0.1 371 /journal.pone.0086044.g002 
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conclude that FMotif has both a higher sensitivity and a higher 
specificity than CisFinder. 

Discussion 

In this study, we have proposed a new and fast heuristic 
enumeration method, FMotif, for extracting motifs from sequenc- 
es. We have used this method to identify motifs and their binding 
locations in widely-used large-scale ChlP-seq and ChlP-exo data 
sets by combining FMotif with a peak-enriched sampling strategy. 
Our empirical studies have shown that this algorithm is fast and 
exact when searching for motifs in (/, d) samples and has achieved 
good performance when identifying motifs in ChlP-enriched 
regions. In addition, the ChlP-enriched sampling strategy worked 
well on large-scale ChlP-seq and ChlP-exo data sets. It not only 
allowed us to exclude both noise occurring in lower ChlP-enriched 
peak regions and false positive motifs contained in background 
sequences, but also let us take advantage of well-developed motif- 
finding tools with low-level scalability. However, it should be 
pointed out that, in general, no method can outperform others 
under all conditions. FMotif performed faster than SPELLER, 
WEEDER, and MITRA but used more memory to store mis- 
matched information in suffix trees, and FMotif was much more 
accurate but much slower than CisFinder. FMotif does, however, 
provide a good trade-off between time, space, and accuracy. 

Motif discovery has been a popular area of study for more than 
two decades. Many successful motif-finding programs have been 
developed. The programs are ideal for finding motifs in tens or 



hundreds of promoters of co-regulated or homologous genes and 
for extracting motifs in genome-wide ChlP-enriched regions 
contained in large-scale ChlP-chip, ChlP-seq, and ChlP-exo data 
sets. Still, the problem is far from solved due to diversity in gene 
expression/regulation and the low specificity of binding sites. With 
the advance of high-throughput and high-resolution sequencing 
techniques like ChlP-exo, researchers have an increasing number 
of tools for studying gene regulation on a genomic scale. This will 
make the motif-finding problem easier to solve. Using advanced 
techniques such as ChlP-exo, it is possible to acquire new 
knowledge of regulatory binding sites. This will not only be 
beneficial for understanding the mechanisms of gene regulation, 
but also for creating a proper computational model that will 
replace (/, d) models and PWM matrix profiles for motif 
representation. 

Materials and Methods 

The (/, d) Motif Search and Suffix Tree 

A transcription factor binds to specific DNA sequences and is 
involved in controlling the transcription of genetic information 
from DNA to mRNA. The actual DNA regions bound by a TF 
usually range in size from 8—10 to 16-20 bps and display a short 
motif, but differ by a few nucleotides from one another. The 
computational problem is to determine such a motif by analyzing a 
set of sequences that contain instances of the motif. 

In current literature, there are two main approaches to motif 
representation. The first involves using a motif profile character- 
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Figure 3. FMotif sensitivity. FMotif sensitivity was measured using an NRSF-positive TFBS set (NRSF/qPCR), which was composed of 83 binding 
sites verified by qPCR [2], four yeast DNA-binding TFs (Reb1, Gal4, Phd1, and Rapl), and one human TF (CTCF) ChlP-exo data sets. Results from 
CisFinder and published motifs in literature are shown for comparison. Column definitions are the same as those in Figure 1. 
doi:1 0.1 371 /journal.pone.0086044.g003 
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ized by a PWM [py.il/x4- The PWM records the probability of an 
observed nucleotide k (ke{A,C,G,T}) at position j 
(/' = {1,2, • • • ,/}) for all aligned sites, where / is the length of the 
motif. Numerous programs have been developed to maximize the 
score of a PWM by measuring, for example, the information 
content of a PWM: 

IC=-Y^Piflog(pij/p b ) 

hi 

where pt, is the background frequency for the nucleotide, which 
measures motif conservation [7] . Using the second approach, one 
can characterize a motif as an /-length consensus string and 
describe it using the most frequent nucleotide in each position of 
all aligned sites under the assumption that each sequence contains 
zero or one motif instance with up to d or exactly d mutations 
within the motif. Finding (/, d) motifs with exactly d mutations is 
more challenging than finding (/, d) motifs with up to d mutations, 
and algorithms designed for the former can usually be directly 
used to find the latter. In addition, profile-based optimization 
methods, e.g., CONSENSUS and MEME, have failed to find (/, d) 
motifs such as (15, 4), where a 15-bp motif is planted into 20 
sequences, each 600 bps in length with exactly 4 mismatches [19]. 
Thus, in this study we focus on designing a fast and exact 
algorithm to find (/, d) motifs with exactly d mutations on (/, d) 
samples. 

When searching for the exact motifs contained in an (/, d) 
sample, it is customary to perform an exhaustive search for all 
potential /-mers and verify their occurrence in the entire sample 
set. When using SPELLER and WEEDER to perform fast /-mer 
substring searching in a sequence set, a suffix tree structure is used 
to index sequences. A suffix tree presents the suffixes of a given 
string or a given set of strings in a way that allows for a very fast 
implementation of string operations. An example of a classic suffix 
tree for the string GAGAG is shown in Figure 4a. When the suffix 
tree of a string with length L is constructed, searching for a 
substring of length / (/ < P) in the string only requires time 
proportional to 0(1) instead of O(L). 

Nevertheless, it is still time consuming to perform an exhaustive 
motif search in a suffix tree of sequences because the search space 
(shown as a four-branch tree in Figure 4b) can be up to 4' in size. 
With the increase in degenerative value d, the valid instances of a 
motif in the suffix tree will increase dramatically. Therefore, 
SPELLER can handle only short (/, d) motifs with / < 1 3 and 
d<3. In order to increase the speed of SPELLER, WEEDER 
introduces an error ratio £ (s~<i//) to narrow the search space 
such that for all ie{l,2,. ..,/}, the number of mismatches between 
the first i nucleotides of a candidate /-mer motif and the first / 
nucleotides of a valid instance of the motif is at most si. The 
strategy can quickly discard /-mers in the search space that do not 
satisfy this restriction. However, not all motif occurrences satisfy 
this restriction, and therefore the real motif may be missed by the 
algorithm. WEEDER lowers the occurrence frequency q<Q<N 
to make sure that the true motif will not be missed. Still, 
WEEDER is an almost exact algorithm. What's more, with the 
decrease of q, WEEDER's run time will increase dramatically 
[14]. Therefore, a fast and exact (/, d) motif search method is 
needed. 

Mismatched Suffix Trees and FMotif 

SPELLER and WEEDER use a depth-first search to scan the 
entire pattern space. If an /-mer along the pattern tree has enough 
instances in the suffix tree of sequences, the /-mer can grow up to 4 



(/+l)-mers in the next layer of the pattern tree (see Figure 4b). 
Otherwise, the end node of an /-mer in the pattern tree will not be 
allowed to grow and the sibling nodes of the /-mer will be checked. 
If any of the sibling nodes can grow to the (/+ l)-th layer in the 
pattern tree, the search process will go down to the (/+ l)-th layer 
of the pattern tree in a depth-first manner. Otherwise, it will 
backtrack to the uncle nodes (the siblings of parent nodes) of the /- 
mer in the (/— l)-th layer of the pattern tree and so forth. The 
algorithms will end at the longest /-mer or /-mers in the pattern 
tree. The difference between SPELLER and WEEDER is that 
WEEDER reduces the number of possible instances of a motif by 
restricting its mutation locations such that the valid paths on the 
pattern tree are sharply reduced. 

We discovered that for finding motifs with length /, both 
SPELLER and WEEDER must check each /-mer (/</) in the 
pattern space with depth-first order and count the variants of the i- 
mer in the suffix tree of sequences from the root to layer /. The 
suffix tree is scanned one time for each /-mer pattern. Thus, the 
algorithms scan the suffix tree an increasing number of times with 
the increase of /. Actually, the mismatch information in layer / of a 
suffix tree can be used to search (i+ l)-mers in the pattern space. 
In this work we constructed a new suffix tree structure with 
mismatch information, called a mismatched suffix tree, for each 
sequence. Using these trees, we took advantage of the mismatch 
information in the i-th layer of the trees to verify (/+ l)-mers in the 
pattern space and then updated the mismatch information in the 
(/+ l)-th layer. In this way we were able to implement a depth-first 
search on the pattern space and mismatched suffix trees 
simultaneously, which avoided a large number of repeated scans 
on the suffix trees of sequences. 

For instance, when searching occurrences of (4, 1) motifs in the 
sequence GAG AC, we started from the root of the pattern tree 
represented as Po = © i n Figure 5a and initialized the mismatched 
suffix tree for the sequence GAG AC. We then checked the 
occurrences of pattern P\= A with up to 1 mismatch in the 
mismatched suffix tree and found that all nodes in the first layer 
have 0 or 1 mismatch(es) with Pi . Next, we updated the mismatch 
value along the valid nodes and linked all of these nodes by points 
(see Figure 5b). We subsequently performed a depth-first search 
again and arrived at the pattern P2 =AA. We updated mismatch 
information for all child nodes of the nodes in the link set in the 
first layer by using the mismatch information of those nodes in the 
link set and found all nodes in the second layer had 1 mismatch 
with the pattern P2 =AA. We updated the mismatch value along 
the valid nodes in the second layer and linked all of these nodes by 
points to form a new link set (see Figure 5c). Then, we moved to 
the pattern Pi=AAA in a depth-first manner and updated 
mismatch information of all child nodes of the nodes in the newly 
generated link set by using the mismatch information of those 
nodes in the new link set. We found that only the child node A, 
representing the 3-mer AGA from the root to node A in the third 
layer of the suffix tree, had 1 mismatch with the pattern P3 =AAA 
(see Figure 5d). Other nodes with 2 mismatches did not need to be 
updated and checked for the longer pattern AAA A. We found that 
the child node of the node A in the third layer did not satisfy the 1- 
mismatch restriction with the pattern AAAA, so we looked at the 
pattern Pn, = AAAC and found a (4, 1) occurrence of P4 (see 
Figure 5e). We then went to the patterns A A AG and A A AT and 
found no occurrence of these patterns in the sequence GAG AC. 
We backtracked to the pattern P' 3 =AAC and updated the 
mismatch information in the third layer by using the mismatch 
information of their parent nodes in link set of the second layer. 
There we found that only node C in the third layer satisfies the 
restriction (see Figure 5f), but that node C has no child. We then 



PLOS ONE I www.plosone.org 



10 



January 2014 | Volume 9 | Issue 1 | e86044 



A Fast (/, d) Motif Finding Algorithm 




(a) (b) 



Figure 4. An example of a suffix tree and a tree representation of pattern space, (a) The suffix tree of the sequence GAGAC. (b) A tree 

representation of pattern space in the search for an (/, d) motif. 

doi:10.1371/journal.pone.0086044.g004 



backtracked to pattern A AG and continued the process as before 
until we found all occurrences for each (4, 1) motif. The details of 
the pattern search and mismatched suffix tree construction are 
shown in the subroutine PatternOnTree(Pk,d,T ,List(Pk-i,T)), 
where T is the mismatched suffix tree for a sequence, Pk is the 
node currently being processed (representing a k-vatr pattern) in 
the k-th layer of the pattern tree, List(Pk-\,T) is the link set 
representing all valid occurrences of the (k— l)-mer pattern 
represented by the node Pk-i in the (k— l)-th layer of the pattern 
tree, and MMC{n^) is mismatch value of the pattern represented 
by Pk compared with the substring represented by the node rijk in 
the tree T. 

PatternOnTree(P k ,d, T ' ,List(P k _ x ,T)) 
Initialize List(P k ,T) = 0; 

for =head node to tail node of List{Pk-\,T) do 

for each rij^eNS do (NS is the child node set of the node 
n itk - 1) 
if rijk = Pk, then 

MMC(n jk ) = MMC(n itk - 1 ) 
else 

MMC(njk) = MMC(n uk _ 1 ) + 1 ; 
if MMC{n jk ) < d then 
add rijk to List(Pk,T); 
return List(P k ,T); 

MotifFinding(P k ,TS,List(Pk,TS),q,dJmax) 
Initialize str = ACGT; 
if k>l max or List(Pk,TS) = 0 then 
return; 
for j = 1 • • • 4 do 
P k +l=Pk + str\j]; 
FailureCount = 0; 

for each tree Ti in the tree set TS do 
List(P k+ i,7j) = PattemOnTree(Pk+ 1 ,d,T,,List(Pk,Ti)); 
if List(P k+u Td = 0 then 
FailureCount + +; 
if FailureCount > \ TS\ — q then 
break; 

if FailureCount < \ TS\ — q then 

Output(Pk+\); 
MotifFinding{P k+l ,TS,List(P k+ \,TS),q,d,l max ) 



For each set of sequences, we counted the number of 
occurrences of a potential pattern in all sequences instead of 
just one sequence shown in subroutine PatternOnTree(P k , 
d,T,List(Pk-i,T)). If the number of occurrences was larger than 
the threshold of occurrence frequency q, it was reported as a 
potential pattern. The subroutine for counting occurrences of a 
(k+ l)-mer pattern, represented by the node Pk+i in the (k+ 1)- 
th layer of the pattern tree, is shown in MotifFinding(Pk, 
TS,List(Pk,TS), q,dj max ), where l max is the maximum length 
allowed for a motif, TS is the set of mismatched suffix trees for all 
sequences, List(P k ,TS)=(jfL 0 List(P k ,Ti), and N is the number 
of total sequences. 

The entire process of finding motifs with at least q occurrences 
in a set of sequences is shown below. Additionally, since there may 
be many motifs that satisfy the quorum restriction q, we sorted all 
potential motifs according to their statistical significance using the 
method in [14,15]. We reported the top n significant motifs and 
their occurrences as output, where (i, j) indicates an instance of a 
motif starting at the j-th position of the i-th sequence s t . 

The FMotif Algorithm 

1) Initialize a mismatched suffix tree Tj like the one shown in 
Figure 5a, i= 1,2, • • • ,N; 

2) Initialize List{Q,Ti) = fetempnode, where fetempnode is the 
pointer of a temporary node, i = 1,2, • • • ,N; 

3) Input q,d,l max ,n; 

4) MotifFindingO, TS,List(Q, TS),q,d,l max )\ 

5) Rank the found motifs according to their significance scores; 

6) Output the top n motifs, their instances, and the positions (/,_/) 
of these instances. 

According to our empirical study, FMotif is capable of 
increasing the speed of the algorithms SPELLER and WEEDER 
without loss of accuracy. In addition, we used the WEEDER 
strategy to further decrease the search space by allowing 
mismatches occurring at most ei times with an increase in i. This 
strategy decreased FMotif s run time but caused problems during 
the tuning of the parameter q and resulted in a loss of accuracy. 
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Figure 5. An example of a (4,1) motif search using FMotif. Figures (a)-(f) illustrate the search process of (4, 1) motifs on the mismatched suffix 

tree of the sequence GAGAC. 

doi:1 0.1 371 /journal.pone.0086044.g005 
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